Skip to content

Conversation

@julienledem
Copy link
Member

This pull request provides an API to read thrift in a streaming fashion.
This enables ignoring fields that are not needed without loading them into memory.
It also aloow treating the data as it comes instead of when it's fully loaded in memory.

dvryaboy pushed a commit to dvryaboy/incubator-parquet-format that referenced this pull request Aug 24, 2014
Author: Julien Le Dem <julien@twitter.com>

Closes apache#8 from julienledem/master and squashes the following commits:

aa0d751 [Julien Le Dem] correct mailling-list
285b39c [Julien Le Dem] Update CONTRIBUTING.md
bca9bc4 [Julien Le Dem] Create CONTRIBUTING.md
@julienledem julienledem changed the title add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields. PARQUET-79: add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields. Aug 25, 2014
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment: The delegatingFieldConsumer delegates addField call to corresponding TypedConsumer?

Conflicts:
	src/main/java/parquet/format/Util.java
@julienledem
Copy link
Member Author

@tsdeng I refactored the APIs a bit, added javadoc and resolved conflicts.

@dvryaboy
Copy link
Contributor

I don't want to block this PR, but I do want to comment on the approach...

While this will likely address the immediate problem (memory pressure when creating splits), having this giant thrift object in a footer is still an issue. For one thing, we have to write it -- the committer that writes the joined footer is probably also under memory pressure since it has to have the merged thrift object in memory. Also, this makes reading just one file from a directory that has 400 unnecessarily expensive (as expensive as initializing a read for all 400 files).

The joined footer doesn't need to be a single object and we should probably move to a model where it's a sequence of objects, one per file, and we can choose which file's metadata we are reading.

@julienledem
Copy link
Member Author

Agreed.
I think one approach will be to rework the metadata a bit so that we don't use the column names as identifiers but rather the column indices (the schema is already in there anyway).

The committer does only one partition at a time so it is much more manageable than when reading many partitions at once. (we had issues with a schema with thousands of columns reading dozens of partitions at a time).

One thrift containing a list of Thrift object versus a sequence of objects, doesn't make any difference in my opinion as we can easily handle the top level object without having to materialize the list as in this PR. This could be applied in the committer for write as well so that it would not load everything in memory.

I'm also proposing a new _common_metadata file, so that we don't have to skip the row groups when we don't need them:
https://github.com/apache/incubator-parquet-mr/pull/45/files#diff-e3b91b40045ac64f6dfd6318453af9c2R319

@julienledem
Copy link
Member Author

@dvryaboy Also if you think this is a good first step toward fixing this issue, feel free to merge.

@dvryaboy
Copy link
Contributor

I'm lost in layers of abstraction... :)

@julienledem
Copy link
Member Author

maybe I should throw in a monad or two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the Consumers here is just used as a namespace. So should it be a package name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed, it's providing some static util methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make TypedConsumer generic
and the add method could be of generic type T and is defined in the super class which is TypedConsumer:

abstract class TypedConsumer<T> {
   abstract public void consume(T value)
}

advantage of this is removing the duplication of addXXX declaration for each XXConsumer

also after adding the Generic, the this.addXXX call can also be lifted to the super class. And each subclass only needs to define the T getValue method. The simplifies the code of TypedConsumer to only focus on 2 things:

  1. get the value from protocol
  2. call the consume method

So the code change would be:

abstract public static class TypedConsumer<T> {
final public void read(TProtocol protocol, EventBasedThriftReader reader) {
  this.consume(getValue(protocol))
} 

abstract public T getValue(TProtocol p) //This method should be implemented in concreate TypedConsumer
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I prefer the name to be consume, rather than add. It's just consumer <-> consume reads better than consumer<->add

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed a bunch of stuff according to your comment.
for the Generic TypedConsumer they actually don't all have the same signature (see Struct, Set, List, Map) so I don't think we should do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you missed this comment:

We can make TypedConsumer generic
and the add method could be of generic type T and is defined in the super class which is TypedConsumer:

abstract class TypedConsumer<T> {
   abstract public void consume(T value)
}

advantage of this is removing the duplication of addXXX declaration for each XXConsumer

also after adding the Generic, the this.addXXX call can also be lifted to the super class. And each subclass only needs to define the T getValue method. The simplifies the code of TypedConsumer to only focus on 2 things:

  1. get the value from protocol
  2. call the consume method

So the code change would be:

abstract public static class TypedConsumer<T> {
final public void read(TProtocol protocol, EventBasedThriftReader reader) {
  this.consume(getValue(protocol))
} 

abstract public T getValue(TProtocol p) //This method should be implemented in concreate TypedConsumer
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did reply to this one:
not all TypedConsumers do consume(T value). See {Set,List,Struct,Map}Consumer so that would not apply in that case.

@tsdeng
Copy link
Contributor

tsdeng commented Aug 29, 2014

LGTM +1

@julienledem
Copy link
Member Author

If you like it, please merge it.

@dvryaboy
Copy link
Contributor

I'll merge.

@asfgit asfgit closed this in addbbb9 Aug 30, 2014
@dvryaboy
Copy link
Contributor

@julienledem using this to read footers is a separate issue, correct? This doesn't seem to.

@julienledem
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants